date()
## [1] "Mon Nov 16 22:40:47 2020"
As you can see, this file have been made and uploaded to GitHub.
# In order to keep the Rmd file clean, all the libraries are downloaded here. Keep scrolling.
library(MASS)
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The data to be analyze this week is called Boston. It comes from MASS package. The data is rather classic set of information. It consist data related to housing values in suburbs of Boston and several other parameters including air quality, crime rate and proportion of people of color at those suburbs. The dataset is rather old. It is first published in 1978.
data("Boston")
The str() method is called for checking the basic structure of the data. The data has 506 rows and 14 columns. The variable names and types are listed here:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Here is the pairs-plot of the dataset. Since the data has quite many columns, it practically shows you nothing.
pairs(Boston)
It is better to use corrplot()-method for checking the data set. The tidyverse and corrplot packages were downloaded for this.
The correlations between different parameters are shown here. The larger and darker the circle is, the stronger the correlation is. Blue represents positive and red negative correlation.
cor_matrix<-cor(Boston) %>% round(digits=2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The histograms for each variable can be drawn using ggplot2. Traditionally, the data set is used for demonstrating effects of different parameters to housing values.
ggplot(gather(Boston), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
In this task, the data set “Boston” is standardized and crime rate is switched to categorical variable.
As one can see, the mean value for each parameter is zero after scaling.
## First, the data set is scaled and summary is printed
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Then the crime rate is switched to categorical variable.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
The final part of this task is to split the scaled data set into test and train data using 80-20 split.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
After the model is trained, the results it gives to the test set is less-biased estimator for the actual model accuracy.
First, the LDA-model is fitted to the data set. This time, the catergorical crime rate is used as target variable.
# These simple models can be applied using a single line of code.
lda.fit <- lda(crime ~ ., data = train)
A new function “lda-arrows” is defined for the task. The function draws the LDA bi-plot.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
One can see, that is actually relatively easy to separate the high-crime rate areas
First, the correct classes are separated from the test set
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
The rest of the test data is fed into the trained LDA model. The cross tabulated results are printed.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 14 2 0
## med_low 0 12 9 0
## med_high 0 7 16 0
## high 0 0 0 28
As the results show, the model can predict the high crime rate suburbs with nearly 100% accuracy. However, the model truly struggles when it tries to decide, should a suburb belong to low or med_low category. In addition, there is some problems to find the difference between med_low and med_high suburbs.
First, let’s reload and standardize the Boston data set
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
Second, let’s build a k-mean clustering algorithm with the scaled boston data.
Let’s select 2 as the number of clusters. Since 14x14 pair plot is bit difficult to look at. Let’s draw three smaller ones.
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled[1:5], col = km$cluster)
pairs(boston_scaled[5:10], col = km$cluster)
pairs(boston_scaled[10:14], col = km$cluster)
The pair plots shows nice separation of two groups. The results are quite simillar as with the LDA algorithm. With LDA we saw that it is relatively easy to find the suburbs with high crime rate. But there was significant overlapping with suburbs that had low, med_low and med_high crime rate.
First, let’s perform the k-means clustering using 3 centers.
km2 <-kmeans(boston_scaled, centers = 3)
Then, let’s try to fit the LDA to the new k-means model output.
# These simple models can be applied using a single line of code.
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled)
Now, let’s look the results. It seems that the model can predict detect the suburbs with high crime rate using this unsupervised learning method.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km2$cluster)
# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
First part of the super bonus task
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.